Sales Performance and Customer Behavior Analysis based on Superstore Dataset¶

Table of Contents¶

  • Introduction
    • Data Source
    • Purpose
  • Data Loading
  • Data Analysis
    • Calculate sales and profits by category
    • Profit Margin Analysis
    • Sub-Category Analysis
    • Trend Analysis
    • Impact of Discounts on Sales and Profits
    • Customer Purchasing Behavior Analysis
    • Customer loyalty
    • Word Cloud for Purchased Product Names

Introduction¶

With growing demands and cut-throat competitions in the market, a Superstore Giant is seeking the knowledge in understanding what works best for them. They would like to understand which products, regions, categories and customer segments they should target or avoid.

Data Source¶

https://www.kaggle.com/datasets/vivek468/superstore-dataset-final/data

Purpose¶

Calculate Sales and Profits by Category: To evaluate how different product categories contribute to overall sales and profits. This allows for strategic decisions on product focus and inventory management.

Profit Margin Analysis: To determine the profitability of products by analyzing profit margins. This helps in pricing strategy and cost management.

Sub-Category Analysis: To drill down into product sub-categories for more detailed insights. This assists in understanding specific market trends and consumer preferences.

Trend Analysis: To identify sales and profit trends over time, such as seasonal patterns or year-over-year growth. This information is crucial for forecasting and planning future sales strategies.

Impact of Discounts on Sales and Profits: To assess how discounts affect sales volumes and profit margins. This analysis helps in optimizing discount strategies to maximize revenue without eroding profit margins.

Customer Purchasing Behavior Analysis: To analyze the purchasing patterns and behaviors of different customer segments. This provides insights into customer needs and can inform targeted marketing campaigns.

Customer Loyalty: To assess customer loyalty by analyzing repeat purchase rates and customer retention metrics. This helps in understanding the customer relationship and identifying opportunities to enhance customer loyalty programs.

Word Cloud for Purchased Product Names: To visual representation of frequently purchased product names to identify popular items and trends among customers.

Data Loading¶

In [24]:
import os
from google.colab import drive
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.palettes import Category10
from wordcloud import WordCloud
from bokeh.resources  import settings

settings.resources = 'inline'
In [25]:
drive.mount('/content/drive')
os.chdir("/content/drive/My Drive/DA/SuperStore")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [26]:
data = pd.read_csv("Superstore.csv",encoding='latin1')
In [27]:
data.head()
Out[27]:
Row ID Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City ... Postal Code Region Product ID Category Sub-Category Product Name Sales Quantity Discount Profit
0 1 CA-2016-152156 11/8/2016 11/11/2016 Second Class CG-12520 Claire Gute Consumer United States Henderson ... 42420 South FUR-BO-10001798 Furniture Bookcases Bush Somerset Collection Bookcase 261.9600 2 0.00 41.9136
1 2 CA-2016-152156 11/8/2016 11/11/2016 Second Class CG-12520 Claire Gute Consumer United States Henderson ... 42420 South FUR-CH-10000454 Furniture Chairs Hon Deluxe Fabric Upholstered Stacking Chairs,... 731.9400 3 0.00 219.5820
2 3 CA-2016-138688 6/12/2016 6/16/2016 Second Class DV-13045 Darrin Van Huff Corporate United States Los Angeles ... 90036 West OFF-LA-10000240 Office Supplies Labels Self-Adhesive Address Labels for Typewriters b... 14.6200 2 0.00 6.8714
3 4 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale ... 33311 South FUR-TA-10000577 Furniture Tables Bretford CR4500 Series Slim Rectangular Table 957.5775 5 0.45 -383.0310
4 5 US-2015-108966 10/11/2015 10/18/2015 Standard Class SO-20335 Sean O'Donnell Consumer United States Fort Lauderdale ... 33311 South OFF-ST-10000760 Office Supplies Storage Eldon Fold 'N Roll Cart System 22.3680 2 0.20 2.5164

5 rows × 21 columns

In [28]:
# View data information
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9994 non-null   int64  
 1   Order ID       9994 non-null   object 
 2   Order Date     9994 non-null   object 
 3   Ship Date      9994 non-null   object 
 4   Ship Mode      9994 non-null   object 
 5   Customer ID    9994 non-null   object 
 6   Customer Name  9994 non-null   object 
 7   Segment        9994 non-null   object 
 8   Country        9994 non-null   object 
 9   City           9994 non-null   object 
 10  State          9994 non-null   object 
 11  Postal Code    9994 non-null   int64  
 12  Region         9994 non-null   object 
 13  Product ID     9994 non-null   object 
 14  Category       9994 non-null   object 
 15  Sub-Category   9994 non-null   object 
 16  Product Name   9994 non-null   object 
 17  Sales          9994 non-null   float64
 18  Quantity       9994 non-null   int64  
 19  Discount       9994 non-null   float64
 20  Profit         9994 non-null   float64
dtypes: float64(3), int64(3), object(15)
memory usage: 1.6+ MB
In [29]:
# Check for missing values
data.isnull().sum()
Out[29]:
Row ID           0
Order ID         0
Order Date       0
Ship Date        0
Ship Mode        0
Customer ID      0
Customer Name    0
Segment          0
Country          0
City             0
State            0
Postal Code      0
Region           0
Product ID       0
Category         0
Sub-Category     0
Product Name     0
Sales            0
Quantity         0
Discount         0
Profit           0
dtype: int64
In [30]:
# Display information for all types of features
data.describe(include=['O'])

#count: Number of non-null entries.
#unique: Number of unique values.
#top: Most frequent value.
#freq: Frequency of the most frequent value.
Out[30]:
Order ID Order Date Ship Date Ship Mode Customer ID Customer Name Segment Country City State Region Product ID Category Sub-Category Product Name
count 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994 9994
unique 5009 1237 1334 4 793 793 3 1 531 49 4 1862 3 17 1850
top CA-2017-100111 9/5/2016 12/16/2015 Standard Class WB-21850 William Brown Consumer United States New York City California West OFF-PA-10001970 Office Supplies Binders Staple envelope
freq 14 38 35 5968 37 37 5191 9994 915 2001 3203 19 6026 1523 48

Data Analysis¶

Calculate sales and profits by category¶

In [31]:
# Aggregate sales and profits by category
category_sales = data.groupby('Category')['Sales'].sum()
category_profits = data.groupby('Category')['Profit'].sum()
category_sales
Out[31]:
Category
Furniture          741999.7953
Office Supplies    719047.0320
Technology         836154.0330
Name: Sales, dtype: float64
In [32]:
# Plotting pie chart for sales by category
plt.figure(figsize=(14, 7))

plt.subplot(1, 2, 1)
category_sales.plot(kind='pie', autopct='%1.1f%%', startangle=140)
plt.title('Sales by Category')
plt.ylabel('')

# Plotting pie chart for profits by category
plt.subplot(1, 2, 2)
category_profits.plot(kind='pie', autopct='%1.1f%%', startangle=140)
plt.title('Profits by Category')
plt.ylabel('')

plt.tight_layout()
plt.show()
No description has been provided for this image

Summary: The pie charts provided show the distribution of sales and profits across different product categories.

The Technology category is both the highest in sales and profits, indicating strong market performance and profitability.

While Furniture has a significant share of sales, its contribution to profits is minimal, suggesting lower profit margins or higher costs associated with this category.

Office Supplies performs well in both sales and profits, indicating a balanced contribution to the overall performance.

Profit Margin Analysis¶

In [33]:
# Calculate profit margin
data['Profit Margin'] = data['Profit'] / data['Sales']

# Aggregate profit margins by category
category_profit_margin = data.groupby('Category')['Profit Margin'].mean()
print(category_profit_margin)
Category
Furniture          0.038784
Office Supplies    0.138030
Technology         0.156138
Name: Profit Margin, dtype: float64
In [34]:
# Plotting pie chart for profit margin by category
plt.figure(figsize=(7, 7))

category_profit_margin.plot(kind='pie', autopct='%1.1f%%', startangle=140)
plt.title('Profit Margin by Category')
plt.ylabel('')
Out[34]:
Text(0, 0.5, '')
No description has been provided for this image

Summary:

Technology not only contributes the most to overall profits but also has the highest profit margin. This indicates that it is both a high-revenue and high-profit category.

Office Supplies has a moderate profit margin, making it a balanced category in terms of both revenue and profitability.

Furniture has the lowest profit margin, suggesting that despite its sales, the profits generated from this category are relatively low. This might be due to higher costs associated with furniture or lower pricing strategies.

Sub-Category Analysis¶

In [35]:
# Aggregate sales and profits by sub-category
subcategory_sales = data.groupby('Sub-Category')['Sales'].sum()
subcategory_profits = data.groupby('Sub-Category')['Profit'].sum()

# Create a DataFrame from the aggregated series
df = pd.DataFrame({
    'Sales': subcategory_sales,
    'Profit': subcategory_profits
})
In [36]:
# Sort DataFrame by Sales for better visualization
df_sorted = df.sort_values(by='Sales', ascending=False)
fig, ax = plt.subplots(figsize=(12, 8))
bar_width = 0.4
# Positions of the bars on x-axis
r = range(len(df_sorted))
# Plot bars for Sales
bars1 = ax.bar(r, df_sorted['Sales'], color='b', width=bar_width, edgecolor='grey', label='Sales')
# Plot bars for Profit
bars2 = ax.bar([i + bar_width for i in r], df_sorted['Profit'], color='r', width=bar_width, edgecolor='grey', label='Profit')
# Set labels and title
ax.set_xlabel('Sub-Category', fontweight='bold')
ax.set_ylabel('Amount', fontweight='bold')
ax.set_xticks([i + bar_width / 2 for i in r])
ax.set_xticklabels(df_sorted.index, rotation=90)
ax.set_title('Total Sales and Profits by Sub-Category')

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        yval = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2, yval + 500, round(yval, 2), ha='center', va='bottom', fontsize=8)

ax.legend()
plt.tight_layout()
plt.show()
No description has been provided for this image

Summary:

Profitable Categories: Electronics such as Phones and Copiers contribute significantly to overall profitability.

Loss-making Categories: Furniture items like Tables, Supplies, and Bookcases are generating losses, requiring further review of pricing or cost structures.

Trend Analysis¶

Analyze sales and profit trends over time.

In [37]:
data['Order Date'] = pd.to_datetime(data['Order Date'])

# Set Order Date as index
data.set_index('Order Date', inplace=True)

# Resample data to get monthly sales and profit
monthly_sales = data.resample('M')['Sales'].sum()
monthly_profits = data.resample('M')['Profit'].sum()
In [38]:
# Create a ColumnDataSource
source = ColumnDataSource(data={
    'date': monthly_sales.index,
    'sales': monthly_sales.values,
    'profits': monthly_profits.values
})

# Create the figure
p = figure(height=400, width=800, x_axis_type='datetime', title='Monthly Sales and Profits')

# Plotting monthly sales
p.line('date', 'sales', source=source, line_width=2, line_color=Category10[3][0], legend_label='Monthly Sales')
p.circle('date', 'sales', source=source, size=8, color=Category10[3][0], legend_label='Monthly Sales')

# Plotting monthly profits
p.line('date', 'profits', source=source, line_width=2, line_color=Category10[3][1], legend_label='Monthly Profits')
p.circle('date', 'profits', source=source, size=8, color=Category10[3][1], legend_label='Monthly Profits')

# Adding HoverTool with tooltips
hover = HoverTool(tooltips=[
    ('Date', '@date{%F}'),
    ('Sales', '@sales{$0.00}'),
    ('Profits', '@profits{$0.00}')
], formatters={'@date': 'datetime'})

p.add_tools(hover)

# Styling and layout adjustments
p.legend.location = 'top_left'
p.legend.click_policy = 'hide'
output_notebook()
# Show the plot
show(p)

Summery:

Sales trends with notable seasonal patterns, including peak sales in late-year months like September and November.

Profitability fluctuates monthly, sometimes showing negative margins despite high sales.

Yearly trends show overall growth, with 2017 notably peaking in November.

Impact of Discounts on Sales and Profits¶

In [39]:
discountsSalesProfit = data[['Discount', 'Sales', 'Profit']]
discountsSalesProfit = discountsSalesProfit.groupby(['Discount']).mean().reset_index()

# Extract data for plotting
discount = discountsSalesProfit['Discount']
sales_mean = discountsSalesProfit['Sales']
profits_mean = discountsSalesProfit['Profit']

# Plotting
plt.figure(figsize=(10, 6))

# Plotting Sales
plt.plot(discount, sales_mean, marker='o', linestyle='-', color='blue', label='Average Sales')

# Plotting Profits
plt.plot(discount, profits_mean, marker='o', linestyle='-', color='green', label='Average Profits')

# Adding labels and title
plt.title('Average Sales and Profits vs. Discount')
plt.xlabel('Discount')
plt.ylabel('Amount ($)')
plt.grid(True)
plt.legend()

# Display plot
plt.show()
No description has been provided for this image
In [39]:

Customer Purchasing Behavior Analysis¶

In [40]:
#  purchasing behavior by region
region_sales = data.groupby('Region')['Sales'].sum()
region_profits = data.groupby('Region')['Profit'].sum()

print(region_sales)
print(region_profits)
Region
Central    501239.8908
East       678781.2400
South      391721.9050
West       725457.8245
Name: Sales, dtype: float64
Region
Central     39706.3625
East        91522.7800
South       46749.4303
West       108418.4489
Name: Profit, dtype: float64
In [41]:
# Plotting Sales
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)  # Subplot for Sales
plt.pie(region_sales, labels=region_sales.index, autopct='%1.1f%%', startangle=140)
plt.title('Sales Distribution by Region')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

# Plotting Profits
plt.subplot(1, 2, 2)  # Subplot for Profits
plt.pie(region_profits, labels=region_profits.index, autopct='%1.1f%%', startangle=140)
plt.title('Profit Distribution by Region')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.tight_layout()
plt.show()
No description has been provided for this image
In [42]:
#purchasing behavior by customer segment
segment_sales = data.groupby('Segment')['Sales'].sum()
segment_profits = data.groupby('Segment')['Profit'].sum()

print(segment_sales)
print(segment_profits)
Segment
Consumer       1.161401e+06
Corporate      7.061464e+05
Home Office    4.296531e+05
Name: Sales, dtype: float64
Segment
Consumer       134119.2092
Corporate       91979.1340
Home Office     60298.6785
Name: Profit, dtype: float64
In [43]:
# Plotting Sales
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)  # Subplot for Sales
plt.pie(segment_sales, labels=segment_sales.index, autopct='%1.1f%%', startangle=140)
plt.title('Sales Distribution by Segment')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

# Plotting Profits
plt.subplot(1, 2, 2)  # Subplot for Profits
plt.pie(segment_profits, labels=segment_profits.index, autopct='%1.1f%%', startangle=140)
plt.title('Profit Distribution by Segment')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.tight_layout()
plt.show()
No description has been provided for this image

Customer loyalty¶

In [44]:
# Counting orders per customer
loyalty_counts = data.groupby('Customer ID')['Order ID'].count()
loyalty_counts
Out[44]:
Customer ID
AA-10315    11
AA-10375    15
AA-10480    12
AA-10645    18
AB-10015     6
            ..
XP-21865    28
YC-21895     8
YS-21880    12
ZC-21910    31
ZD-21925     9
Name: Order ID, Length: 793, dtype: int64
In [45]:
# Plotting
plt.figure(figsize=(10, 6))
plt.hist(loyalty_counts, bins=range(1, loyalty_counts.max() + 2), align='left', edgecolor='black')
plt.xlabel('Number of Orders')
plt.ylabel('Number of Customers')
plt.title('Customer Loyalty')
plt.xticks(range(1, loyalty_counts.max() + 1))
plt.grid(axis='y', alpha=0.75)

plt.show()
No description has been provided for this image

Word Cloud for Purchased Product Names¶

In [46]:
# Combine the product names into a single string
text = ' '.join(data['Product Name'])

# Generate the word cloud
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

# Display the word cloud
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
No description has been provided for this image

Summary: The word cloud summarizes the dataset by highlighting frequently occurring terms, which can be useful for identifying popular product features at a glance.